Investigating Spectral Amplitude Modulation Phase Hierarchy Features in Speech Synthesis

نویسندگان

  • Alexandros Lazaridis
  • Milos Cernak
  • Pierre-Edouard Honnet
  • Philip N. Garner
چکیده

In our recent work, a novel speech synthesis with enhanced prosody (SSEP) system using probabilistic amplitude demodulation (PAD) features was introduced. These features were used to improve prosody in speech synthesis. The PAD was applied iteratively for generating syllable and stress amplitude modulations in a cascade manner. The PAD features were used as a secondary input scheme along with the standard text-based input features in deep neural network (DNN) speech synthesis. Objective and subjective evaluation validated the improvement of the quality of the synthesized speech. In this paper, a spectral amplitude modulation phase hierarchy (S-AMPH) technique is used in a similar to the PAD speech synthesis scheme, way. Instead of the two modulations used in PAD case, three modulations, i.e., stress-, syllableand phoneme-level ones (2, 5 and 20 Hz respectively) are implemented with the S-AMPH model. The objective evaluation has shown that the proposed system using the S-AMPH features improved synthetic speech quality in respect to the system using the PAD features; in terms of relative reduction in mel-cepstral distortion (MCD) by approximately 9% and in terms of relative reduction in root mean square error (RMSE) of the fundamental frequency (F0) by approximately 25%. Multi-task training is also investigated in this work, giving no statistically significant improvements.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Efficient Hierarchical Modulation based Orthogonal Frequency Division Multiplexing Transmission Scheme for Digital Video Broadcasting

Due to the increase of users the efficient usage of spectrum plays an important role in digital terrestrial television networks. In digital video broadcasting, local and global content are transmitted by single frequency network and multifrequency network respectively. Multifrequency network support transmission of global content and it consumes large spectrum. Similarly local content are well ...

متن کامل

Speech analysis and synthesis using an AM-FM modulation model

In this paper, the AM{FM modulation model is applied to speech analysis, synthesis and coding. The multiband demodulation pitch tracking algorithm is proposed that produces smooth and accurate fundamental frequency contours. The AM{ FM modulation vocoder represents speech as the sum of resonance signals modeled by their amplitude envelope and instantaneous frequency signals. E cient modeling an...

متن کامل

Acoustic-Emergent Phonology in the Amplitude Envelope of Child-Directed Speech.

When acquiring language, young children may use acoustic spectro-temporal patterns in speech to derive phonological units in spoken language (e.g., prosodic stress patterns, syllables, phonemes). Children appear to learn acoustic-phonological mappings rapidly, without direct instruction, yet the underlying developmental mechanisms remain unclear. Across different languages, a relationship betwe...

متن کامل

The Function of Pitch Range Variations in Samples of Emotional Expressions in Persian

This study aims at investigating the interface between emotion and intonation patterns (more specifically, duration and pitch amplitude of speech). To this end, the acoustic properties of spectral parameters related to speech prosody are investigated. The results of acoustic and Statistical analysis show that mean level and range of FO in the contours vary strongly as a function of the degree o...

متن کامل

The relation between speech intelligibility and the complex modulation spectrum

The amplitude and phase components of the modulation spectrum were dissociated in order to ascertain the importance of cross-spectral, envelope-modulation phase information for understanding spoken language. The dissociation was effected via local time reversals of the speech waveform (i.e., flipping the signal on its horizontal axis) at intervals ranging between 0 and 180 ms. Intelligibility d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016